Adversarial Demonstration Attacks on Large Language Models
- URL: http://arxiv.org/abs/2305.14950v2
- Date: Sat, 14 Oct 2023 05:03:53 GMT
- Title: Adversarial Demonstration Attacks on Large Language Models
- Authors: Jiongxiao Wang, Zichen Liu, Keun Hee Park, Zhuojun Jiang, Zhaoheng
Zheng, Zhuofeng Wu, Muhao Chen, Chaowei Xiao
- Abstract summary: We investigate the security concern of in-context learning (ICL) from an adversarial perspective.
We propose a novel attack method named advICL, which aims to manipulate only the demonstration without changing the input to mislead the models.
Our results demonstrate that as the number of demonstrations increases, the robustness of in-context learning would decrease.
- Score: 43.15298174675082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the emergence of more powerful large language models (LLMs), such as
ChatGPT and GPT-4, in-context learning (ICL) has gained significant prominence
in leveraging these models for specific tasks by utilizing data-label pairs as
precondition prompts. While incorporating demonstrations can greatly enhance
the performance of LLMs across various tasks, it may introduce a new security
concern: attackers can manipulate only the demonstrations without changing the
input to perform an attack. In this paper, we investigate the security concern
of ICL from an adversarial perspective, focusing on the impact of
demonstrations. We propose a novel attack method named advICL, which aims to
manipulate only the demonstration without changing the input to mislead the
models. Our results demonstrate that as the number of demonstrations increases,
the robustness of in-context learning would decrease. Additionally, we also
identify the intrinsic property of the demonstrations is that they can be used
(prepended) with different inputs. As a result, it introduces a more practical
threat model in which an attacker can attack the test input example even
without knowing and manipulating it. To achieve it, we propose the transferable
version of advICL, named Transferable-advICL. Our experiment shows that the
adversarial demonstration generated by Transferable-advICL can successfully
attack the unseen test input examples. We hope that our study reveals the
critical security risks associated with ICL and underscores the need for
extensive research on the robustness of ICL, particularly given its increasing
significance in the advancement of LLMs.
Related papers
- Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability [44.99833362998488]
Large Language Models (LLMs) have shown impressive performance across a wide range of tasks.
LLMs in particular are known to be vulnerable to adversarial attacks, where an imperceptible change to the input can mislead the output of the model.
We propose a method, based on Mechanistic Interpretability (MI) techniques, to guide this process.
arXiv Detail & Related papers (2024-07-29T09:55:34Z) - Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning [21.018893978967053]
In-Context Learning (ICL) is sensitive to the choice, order, and verbaliser used to encode the demonstrations in the prompt.
Retrieval-Augmented ICL methods try to address this problem by leveraging retrievers to extract semantically related examples as demonstrations.
Our study reveals that retrieval-augmented models can enhance robustness against test sample attacks.
We introduce an effective training-free adversarial defence method, DARD, which enriches the example pool with those attacked samples.
arXiv Detail & Related papers (2024-05-24T23:56:36Z) - Data Poisoning for In-context Learning [49.77204165250528]
In-context learning (ICL) has been recognized for its innovative ability to adapt to new tasks.
This paper delves into the critical issue of ICL's susceptibility to data poisoning attacks.
We introduce ICLPoison, a specialized attacking framework conceived to exploit the learning mechanisms of ICL.
arXiv Detail & Related papers (2024-02-03T14:20:20Z) - Comparable Demonstrations are Important in In-Context Learning: A Novel
Perspective on Demonstration Selection [22.29452683679149]
In-Context Learning (ICL) is an important paradigm for adapting Large Language Models (LLMs) to downstream tasks through a few demonstrations.
This study explores the ICL mechanisms from a novel perspective, providing a deeper insight into the demonstration selection strategy for ICL.
arXiv Detail & Related papers (2023-12-12T18:05:46Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - Hijacking Large Language Models via Adversarial In-Context Learning [8.15194326639149]
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks.
Existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL.
This work introduces a novel transferable attack against ICL to address these issues.
arXiv Detail & Related papers (2023-11-16T15:01:48Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations [38.437893814759086]
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns.
We propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses.
arXiv Detail & Related papers (2023-10-10T07:50:29Z) - Defending Pre-trained Language Models as Few-shot Learners against
Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners.
We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.