Query-Based Adversarial Prompt Generation
- URL: http://arxiv.org/abs/2402.12329v1
- Date: Mon, 19 Feb 2024 18:01:36 GMT
- Title: Query-Based Adversarial Prompt Generation
- Authors: Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tram\`er,
Milad Nasr
- Abstract summary: We build adversarial examples that cause an aligned language model to emit harmful strings.
We validate our attack on GPT-3.5 and OpenAI's safety classifier.
- Score: 67.238873588125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown it is possible to construct adversarial examples that
cause an aligned language model to emit harmful strings or perform harmful
behavior. Existing attacks work either in the white-box setting (with full
access to the model weights), or through transferability: the phenomenon that
adversarial examples crafted on one model often remain effective on other
models. We improve on prior work with a query-based attack that leverages API
access to a remote language model to construct adversarial examples that cause
the model to emit harmful strings with (much) higher probability than with
transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety
classifier; we can cause GPT-3.5 to emit harmful strings that current transfer
attacks fail at, and we can evade the safety classifier with nearly 100%
probability.
Related papers
- Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models [8.348993615202138]
backdoor attacks on large language models (LLMs) typically set a fixed trigger in the input instance and specific responses for triggered queries.
We present a new poisoning paradigm against LLMs triggered by specifying generation conditions.
The poisoned model performs normally for output under normal/other generation conditions, while becoming harmful for output under target generation conditions.
arXiv Detail & Related papers (2024-04-23T07:19:20Z) - Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning [14.011140902511135]
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks.
Despite being widely applied, in-context learning is vulnerable to malicious attacks.
We design a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning.
arXiv Detail & Related papers (2024-01-11T14:38:19Z) - LoFT: Local Proxy Fine-tuning For Improving Transferability Of
Adversarial Attacks Against Large Language Model [29.068442824880016]
Local fine-tuning of proxy models improves attack transferability and increases attack success rate by $39%$, $7%$, and $0.5%$ (absolute) on target models ChatGPT, GPT-4, and Claude respectively.
arXiv Detail & Related papers (2023-10-02T23:29:23Z) - Transferable Attack for Semantic Segmentation [59.17710830038692]
adversarial attacks, and observe that the adversarial examples generated from a source model fail to attack the target models.
We propose an ensemble attack for semantic segmentation to achieve more effective attacks with higher transferability.
arXiv Detail & Related papers (2023-07-31T11:05:55Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z) - Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment.
We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models.
We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z) - Can Adversarial Examples Be Parsed to Reveal Victim Model Information? [62.814751479749695]
In this work, we ask whether it is possible to infer data-agnostic victim model (VM) information from data-specific adversarial instances.
We collect a dataset of adversarial attacks across 7 attack types generated from 135 victim models.
We show that a simple, supervised model parsing network (MPN) is able to infer VM attributes from unseen adversarial attacks.
arXiv Detail & Related papers (2023-03-13T21:21:49Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Generating Label Cohesive and Well-Formed Adversarial Claims [44.29895319592488]
Adversarial attacks reveal important vulnerabilities and flaws of trained models.
We investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning.
We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.
arXiv Detail & Related papers (2020-09-17T10:50:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.