Related papers: Unveiling Vulnerability of Self-Attention

Unveiling Vulnerability of Self-Attention

URL: http://arxiv.org/abs/2402.16470v1
Date: Mon, 26 Feb 2024 10:31:45 GMT
Title: Unveiling Vulnerability of Self-Attention
Authors: Khai Jiet Liong, Hongqiu Wu, Hai Zhao
Abstract summary: Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes. This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism. We introduce textitS-Attend, a novel smoothing technique that effectively makes SA robust via structural perturbations.
Score: 61.85150061213987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes, which poses a big threat to real-world systems. While previous studies directly focus on manipulating word inputs, they are limited by their means of generating adversarial samples, lacking generalization to versatile real-world attack. This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism. (1) We propose a powerful perturbation technique \textit{HackAttend}, which perturbs the attention scores within the SA matrices via meticulously crafted attention masks. We show that state-of-the-art PLMs fall into heavy vulnerability that minor attention perturbations $(1\%)$ can produce a very high attack success rate $(98\%)$. Our paper expands the conventional text attack of word perturbations to more general structural perturbations. (2) We introduce \textit{S-Attend}, a novel smoothing technique that effectively makes SA robust via structural perturbations. We empirically demonstrate that this simple yet effective technique achieves robust performance on par with adversarial training when facing various text attackers. Code is publicly available at \url{github.com/liongkj/HackAttend}.

Related papers

ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks. $textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning. $textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z)
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models [3.0308780927465135]
We present a series of structure transformation attacks on LLM alignment, where we encode natural language intent using diverse syntax spaces. Our simplest attacks can achieve close to 90% success rate, even on strict LLMs. We develop a benchmark and evaluate existing safety-alignment defenses against it, showing that most of them fail with 100% ASR.
arXiv Detail & Related papers (2025-02-17T14:46:38Z)
Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models [32.23201683108716]
We propose a novel strategy that exclusively employs image patches for attacks, thus preserving the integrity of the original text. Our method leverages prior knowledge from diffusion models to enhance the authenticity and naturalness of the perturbations. Comprehensive experiments conducted in a white-box setting for image-to-text scenarios reveal that our proposed method significantly outperforms existing techniques, achieving a 100% attack success rate.
arXiv Detail & Related papers (2024-10-07T10:06:01Z)
CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models [12.386141652094999]
Existing certified robustness based on random smoothing has shown considerable promise in certifying the input-specific text perturbations. A naive method is to simply increase the masking ratio and the likelihood of masking attack tokens, but it leads to a significant reduction in both certified accuracy and the certified radius. We introduce a novel approach, designed to identify a superior prompt that maintains higher certified accuracy under extensive masking.
arXiv Detail & Related papers (2024-06-04T01:02:22Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
Few-Shot Adversarial Prompt Learning on Vision-Language Models [62.50622628004134]
The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention. Previous efforts achieved zero-shot adversarial robustness by aligning adversarial visual features with text supervision. We propose a few-shot adversarial prompt framework where adapting input sequences with limited data makes significant adversarial robustness improvement.
arXiv Detail & Related papers (2024-03-21T18:28:43Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
Sparse and Transferable Universal Singular Vectors Attack [5.498495800909073]
We propose a novel sparse universal white-box adversarial attack. Our approach is based on truncated power providing sparsity to $(p,q)$-singular vectors of the hidden layers of Jacobian matrices. Our findings demonstrate the vulnerability of state-of-the-art models to sparse attacks and highlight the importance of developing robust machine learning systems.
arXiv Detail & Related papers (2024-01-25T09:21:29Z)
SemAttack: Natural Textual Attacks via Different Semantic Spaces [26.97034787803082]
We propose an efficient framework to generate natural adversarial text by constructing different semantic perturbation functions. We show that SemAttack is able to generate adversarial texts for different languages with high attack success rates. Our generated adversarial texts are natural and barely affect human performance.
arXiv Detail & Related papers (2022-05-03T03:44:03Z)
Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data. In this paper, we propose variable-length textual adversarial attacks(VL-Attack) Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z)
MixNet for Generalized Face Presentation Attack Detection [63.35297510471997]
We have proposed a deep learning-based network termed as textitMixNet to detect presentation attacks. The proposed algorithm utilizes state-of-the-art convolutional neural network architectures and learns the feature mapping for each attack category.
arXiv Detail & Related papers (2020-10-25T23:01:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.