COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in
  Language Models
        - URL: http://arxiv.org/abs/2306.05659v3
- Date: Thu, 14 Sep 2023 03:23:34 GMT
- Title: COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in
  Language Models
- Authors: Zihao Tan, Qingliang Chen, Wenbin Zhu and Yongjian Huang
- Abstract summary: We propose a prompt-based adversarial attack on manual templates in black box scenarios.
First of all, we design character-level and word-level approaches to break manual templates separately.
And we present a greedy algorithm for the attack based on the above destructive approaches.
- Score: 4.776465250559034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Prompt-based learning has been proved to be an effective way in pre-trained
language models (PLMs), especially in low-resource scenarios like few-shot
settings. However, the trustworthiness of PLMs is of paramount significance and
potential vulnerabilities have been shown in prompt-based templates that could
mislead the predictions of language models, causing serious security concerns.
In this paper, we will shed light on some vulnerabilities of PLMs, by proposing
a prompt-based adversarial attack on manual templates in black box scenarios.
First of all, we design character-level and word-level heuristic approaches to
break manual templates separately. Then we present a greedy algorithm for the
attack based on the above heuristic destructive approaches. Finally, we
evaluate our approach with the classification tasks on three variants of BERT
series models and eight datasets. And comprehensive experimental results
justify the effectiveness of our approach in terms of attack success rate and
attack speed.
 
      
        Related papers
        - A Survey on Model Extraction Attacks and Defenses for Large Language   Models [55.60375624503877]
 Model extraction attacks pose significant security threats to deployed language models.<n>This survey provides a comprehensive taxonomy of extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks.<n>We examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios.
 arXiv  Detail & Related papers  (2025-06-26T22:02:01Z)
- Robustness of Large Language Models Against Adversarial Attacks [5.312946761836463]
 We present a comprehensive study on the robustness of GPT LLM family.
We employ two distinct evaluation methods to assess their resilience.
Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks.
 arXiv  Detail & Related papers  (2024-12-22T13:21:15Z)
- Detecting and Understanding Vulnerabilities in Language Models via   Mechanistic Interpretability [44.99833362998488]
 Large Language Models (LLMs) have shown impressive performance across a wide range of tasks.
LLMs in particular are known to be vulnerable to adversarial attacks, where an imperceptible change to the input can mislead the output of the model.
We propose a method, based on Mechanistic Interpretability (MI) techniques, to guide this process.
 arXiv  Detail & Related papers  (2024-07-29T09:55:34Z)
- MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
 We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
 Empirical evaluations conducted on different datasets validate the efficacy of our approach.
 arXiv  Detail & Related papers  (2024-06-13T15:55:04Z)
- Defending Large Language Models Against Attacks With Residual Stream   Activation Analysis [0.0]
 Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
 arXiv  Detail & Related papers  (2024-06-05T13:06:33Z)
- Universal Vulnerabilities in Large Language Models: Backdoor Attacks for   In-context Learning [14.011140902511135]
 In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks.
Despite being widely applied, in-context learning is vulnerable to malicious attacks.
We design a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning.
 arXiv  Detail & Related papers  (2024-01-11T14:38:19Z)
- Defending Pre-trained Language Models as Few-shot Learners against
  Backdoor Attacks [72.03945355787776]
 We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners.
We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
 arXiv  Detail & Related papers  (2023-09-23T04:41:55Z)
- Defense-Prefix for Preventing Typographic Attacks on CLIP [14.832208701208414]
 Some adversarial attacks fool a model into false or absurd classifications.
We introduce our simple yet effective method: Defense-Prefix (DP), which inserts the DP token before a class name to make words "robust" against typographic attacks.
Our method significantly improves the accuracy of classification tasks for typographic attack datasets, while maintaining the zero-shot capabilities of the model.
 arXiv  Detail & Related papers  (2023-04-10T11:05:20Z)
- PromptAttack: Prompt-based Attack for Language Models via Gradient
  Search [24.42194796252163]
 We observe that the prompt learning methods are vulnerable and can easily be attacked by some illegally constructed prompts.
In this paper, we propose a malicious prompt template construction method (textbfPromptAttack) to probe the security performance of PLMs.
 arXiv  Detail & Related papers  (2022-09-05T10:28:20Z)
- A Unified Evaluation of Textual Backdoor Learning: Frameworks and
  Benchmarks [72.7373468905418]
 We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
 arXiv  Detail & Related papers  (2022-06-17T02:29:23Z)
- Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
  Language Models [86.02610674750345]
 Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
 arXiv  Detail & Related papers  (2021-11-04T12:59:55Z)
- Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
 It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
 arXiv  Detail & Related papers  (2021-04-16T14:37:27Z)
- Adversarial Attack and Defense of Structured Prediction Models [58.49290114755019]
 In this paper, we investigate attacks and defenses for structured prediction tasks in NLP.
The structured output of structured prediction models is sensitive to small perturbations in the input.
We propose a novel and unified framework that learns to attack a structured prediction model using a sequence-to-sequence model.
 arXiv  Detail & Related papers  (2020-10-04T15:54:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.