DALA: A Distribution-Aware LoRA-Based Adversarial Attack against
Language Models
- URL: http://arxiv.org/abs/2311.08598v2
- Date: Sat, 17 Feb 2024 19:14:20 GMT
- Title: DALA: A Distribution-Aware LoRA-Based Adversarial Attack against
Language Models
- Authors: Yibo Wang, Xiangjue Dong, James Caverlee, Philip S. Yu
- Abstract summary: Adversarial attacks can introduce subtle perturbations to input data.
Recent attack methods can achieve a relatively high attack success rate (ASR)
We propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method.
- Score: 64.79319733514266
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Language models (LMs) can be manipulated by adversarial attacks, which
introduce subtle perturbations to input data. While recent attack methods can
achieve a relatively high attack success rate (ASR), we've observed that the
generated adversarial examples have a different data distribution compared with
the original examples. Specifically, these adversarial examples exhibit reduced
confidence levels and greater divergence from the training data distribution.
Consequently, they are easy to detect using straightforward detection methods,
diminishing the efficacy of such attacks. To address this issue, we propose a
Distribution-Aware LoRA-based Adversarial Attack (DALA) method. DALA considers
distribution shifts of adversarial examples to improve the attack's
effectiveness under detection methods. We further design a novel evaluation
metric, the Non-detectable Attack Success Rate (NASR), which integrates both
ASR and detectability for the attack task. We conduct experiments on four
widely used datasets to validate the attack effectiveness and transferability
of adversarial examples generated by DALA against both the white-box BERT-base
model and the black-box LLaMA2-7b model. Our codes are available at
https://anonymous.4open.science/r/DALA-A16D/.
Related papers
- Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector [97.92369017531038]
We build a new laRge-scale Adervsarial images dataset with Diverse hArmful Responses (RADAR)
We then develop a novel iN-time Embedding-based AdveRSarial Image DEtection (NEARSIDE) method, which exploits a single vector that distilled from the hidden states of Visual Language Models (VLMs) to achieve the detection of adversarial images against benign ones in the input.
arXiv Detail & Related papers (2024-10-30T10:33:10Z) - Fooling SHAP with Output Shuffling Attacks [4.873272103738719]
Explainable AI(XAI) methods such as SHAP can help discover feature attributions in black-box models.
adversarial attacks can subvert the detection of XAI methods.
We propose a novel family of attacks, called shuffling attacks, that are data-agnostic.
arXiv Detail & Related papers (2024-08-12T21:57:18Z) - DTA: Distribution Transform-based Attack for Query-Limited Scenario [11.874670564015789]
In generating adversarial examples, the conventional black-box attack methods rely on sufficient feedback from the to-be-attacked models.
This paper proposes a hard-label attack that simulates an attacked action being permitted to conduct a limited number of queries.
Experiments validate the effectiveness of the proposed idea and the superiority of DTA over the state-of-the-art.
arXiv Detail & Related papers (2023-12-12T13:21:03Z) - Defending Pre-trained Language Models as Few-shot Learners against
Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners.
We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z) - Microbial Genetic Algorithm-based Black-box Attack against Interpretable
Deep Learning Systems [16.13790238416691]
In white-box environments, interpretable deep learning systems (IDLSes) have been shown to be vulnerable to malicious manipulations.
We propose a Query-efficient Score-based black-box attack against IDLSes, QuScore, which requires no knowledge of the target model and its coupled interpretation model.
arXiv Detail & Related papers (2023-07-13T00:08:52Z) - Versatile Weight Attack via Flipping Limited Bits [68.45224286690932]
We study a novel attack paradigm, which modifies model parameters in the deployment stage.
Considering the effectiveness and stealthiness goals, we provide a general formulation to perform the bit-flip based weight attack.
We present two cases of the general formulation with different malicious purposes, i.e., single sample attack (SSA) and triggered samples attack (TSA)
arXiv Detail & Related papers (2022-07-25T03:24:58Z) - Defending against the Label-flipping Attack in Federated Learning [5.769445676575767]
Federated learning (FL) provides autonomy and privacy by design to participating peers.
The label-flipping (LF) attack is a targeted poisoning attack where the attackers poison their training data by flipping the labels of some examples.
We propose a novel defense that first dynamically extracts those gradients from the peers' local updates.
arXiv Detail & Related papers (2022-07-05T12:02:54Z) - On Trace of PGD-Like Adversarial Attacks [77.75152218980605]
Adversarial attacks pose safety and security concerns for deep learning applications.
We construct Adrial Response Characteristics (ARC) features to reflect the model's gradient consistency.
Our method is intuitive, light-weighted, non-intrusive, and data-undemanding.
arXiv Detail & Related papers (2022-05-19T14:26:50Z) - Mitigating the Impact of Adversarial Attacks in Very Deep Networks [10.555822166916705]
Deep Neural Network (DNN) models have vulnerabilities related to security concerns.
Data poisoning-enabled perturbation attacks are complex adversarial ones that inject false data into models.
We propose an attack-agnostic-based defense method for mitigating their influence.
arXiv Detail & Related papers (2020-12-08T21:25:44Z) - Adversarial Example Games [51.92698856933169]
Adrial Example Games (AEG) is a framework that models the crafting of adversarial examples.
AEG provides a new way to design adversarial examples by adversarially training a generator and aversa from a given hypothesis class.
We demonstrate the efficacy of AEG on the MNIST and CIFAR-10 datasets.
arXiv Detail & Related papers (2020-07-01T19:47:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.