Related papers: Query-Based Adversarial Prompt Generation

Query-Based Adversarial Prompt Generation

URL: http://arxiv.org/abs/2402.12329v1
Date: Mon, 19 Feb 2024 18:01:36 GMT
Title: Query-Based Adversarial Prompt Generation
Authors: Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tram\`er, Milad Nasr
Abstract summary: We build adversarial examples that cause an aligned language model to emit harmful strings. We validate our attack on GPT-3.5 and OpenAI's safety classifier.
Score: 67.238873588125
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.

Related papers

Investigating Adversarial Trigger Transfer in Large Language Models [28.97206621629125]
We show that adversarial triggers are not consistently transferable. We show that APO models are extremely hard to jailbreak even when the trigger is optimized directly on the model. Most triggers optimized on AFT models also generalize to new unsafe instructions from five diverse domains.
arXiv Detail & Related papers (2024-04-24T17:53:14Z)
Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models [8.348993615202138]
backdoor attacks on large language models (LLMs) typically set a fixed trigger in the input instance and specific responses for triggered queries. We present a new poisoning paradigm against LLMs triggered by specifying generation conditions. The poisoned model performs normally for output under normal/other generation conditions, while becoming harmful for output under target generation conditions.
arXiv Detail & Related papers (2024-04-23T07:19:20Z)
Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning [14.011140902511135]
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks. Despite being widely applied, in-context learning is vulnerable to malicious attacks. We design a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning.
arXiv Detail & Related papers (2024-01-11T14:38:19Z)
LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model [29.068442824880016]
Local fine-tuning of proxy models improves attack transferability and increases attack success rate by $39%$, $7%$, and $0.5%$ (absolute) on target models ChatGPT, GPT-4, and Claude respectively.
arXiv Detail & Related papers (2023-10-02T23:29:23Z)
Transferable Attack for Semantic Segmentation [59.17710830038692]
adversarial attacks, and observe that the adversarial examples generated from a source model fail to attack the target models. We propose an ensemble attack for semantic segmentation to achieve more effective attacks with higher transferability.
arXiv Detail & Related papers (2023-07-31T11:05:55Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z)
Can Adversarial Examples Be Parsed to Reveal Victim Model Information? [62.814751479749695]
In this work, we ask whether it is possible to infer data-agnostic victim model (VM) information from data-specific adversarial instances. We collect a dataset of adversarial attacks across 7 attack types generated from 135 victim models. We show that a simple, supervised model parsing network (MPN) is able to infer VM attributes from unseen adversarial attacks.
arXiv Detail & Related papers (2023-03-13T21:21:49Z)
Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data. In this paper, we propose variable-length textual adversarial attacks(VL-Attack) Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z)
Generating Label Cohesive and Well-Formed Adversarial Claims [44.29895319592488]
Adversarial attacks reveal important vulnerabilities and flaws of trained models. We investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning. We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.
arXiv Detail & Related papers (2020-09-17T10:50:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.