Unveiling Vulnerability of Self-Attention
- URL: http://arxiv.org/abs/2402.16470v1
- Date: Mon, 26 Feb 2024 10:31:45 GMT
- Title: Unveiling Vulnerability of Self-Attention
- Authors: Khai Jiet Liong, Hongqiu Wu, Hai Zhao
- Abstract summary: Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes.
This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism.
We introduce textitS-Attend, a novel smoothing technique that effectively makes SA robust via structural perturbations.
- Score: 61.85150061213987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (PLMs) are shown to be vulnerable to minor word
changes, which poses a big threat to real-world systems. While previous studies
directly focus on manipulating word inputs, they are limited by their means of
generating adversarial samples, lacking generalization to versatile real-world
attack. This paper studies the basic structure of transformer-based PLMs, the
self-attention (SA) mechanism. (1) We propose a powerful perturbation technique
\textit{HackAttend}, which perturbs the attention scores within the SA matrices
via meticulously crafted attention masks. We show that state-of-the-art PLMs
fall into heavy vulnerability that minor attention perturbations $(1\%)$ can
produce a very high attack success rate $(98\%)$. Our paper expands the
conventional text attack of word perturbations to more general structural
perturbations. (2) We introduce \textit{S-Attend}, a novel smoothing technique
that effectively makes SA robust via structural perturbations. We empirically
demonstrate that this simple yet effective technique achieves robust
performance on par with adversarial training when facing various text
attackers. Code is publicly available at \url{github.com/liongkj/HackAttend}.
Related papers
- Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models [32.23201683108716]
We propose a novel strategy that exclusively employs image patches for attacks, thus preserving the integrity of the original text.
Our method leverages prior knowledge from diffusion models to enhance the authenticity and naturalness of the perturbations.
Comprehensive experiments conducted in a white-box setting for image-to-text scenarios reveal that our proposed method significantly outperforms existing techniques, achieving a 100% attack success rate.
arXiv Detail & Related papers (2024-10-07T10:06:01Z) - CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models [12.386141652094999]
Existing certified robustness based on random smoothing has shown considerable promise in certifying the input-specific text perturbations.
A naive method is to simply increase the masking ratio and the likelihood of masking attack tokens, but it leads to a significant reduction in both certified accuracy and the certified radius.
We introduce a novel approach, designed to identify a superior prompt that maintains higher certified accuracy under extensive masking.
arXiv Detail & Related papers (2024-06-04T01:02:22Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Few-Shot Adversarial Prompt Learning on Vision-Language Models [62.50622628004134]
The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention.
Previous efforts achieved zero-shot adversarial robustness by aligning adversarial visual features with text supervision.
We propose a few-shot adversarial prompt framework where adapting input sequences with limited data makes significant adversarial robustness improvement.
arXiv Detail & Related papers (2024-03-21T18:28:43Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - Sparse and Transferable Universal Singular Vectors Attack [5.498495800909073]
We propose a novel sparse universal white-box adversarial attack.
Our approach is based on truncated power providing sparsity to $(p,q)$-singular vectors of the hidden layers of Jacobian matrices.
Our findings demonstrate the vulnerability of state-of-the-art models to sparse attacks and highlight the importance of developing robust machine learning systems.
arXiv Detail & Related papers (2024-01-25T09:21:29Z) - SemAttack: Natural Textual Attacks via Different Semantic Spaces [26.97034787803082]
We propose an efficient framework to generate natural adversarial text by constructing different semantic perturbation functions.
We show that SemAttack is able to generate adversarial texts for different languages with high attack success rates.
Our generated adversarial texts are natural and barely affect human performance.
arXiv Detail & Related papers (2022-05-03T03:44:03Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - MixNet for Generalized Face Presentation Attack Detection [63.35297510471997]
We have proposed a deep learning-based network termed as textitMixNet to detect presentation attacks.
The proposed algorithm utilizes state-of-the-art convolutional neural network architectures and learns the feature mapping for each attack category.
arXiv Detail & Related papers (2020-10-25T23:01:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.