ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
- URL: http://arxiv.org/abs/2403.02910v2
- Date: Wed, 6 Mar 2024 04:29:32 GMT
- Title: ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
- Authors: Xijia Tao, Shuai Zhong, Lei Li, Qi Liu, Lingpeng Kong
- Abstract summary: We propose a novel jailbreaking attack against vision language models (VLMs)
A scenario where our poisoned (image, text) data pairs are included in the training data is assumed.
By replacing the original textual captions with malicious jailbreak prompts, our method can perform jailbreak attacks with the poisoned images.
- Score: 40.55590043993117
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: There has been an increasing interest in the alignment of large language
models (LLMs) with human values. However, the safety issues of their
integration with a vision module, or vision language models (VLMs), remain
relatively underexplored. In this paper, we propose a novel jailbreaking attack
against VLMs, aiming to bypass their safety barrier when a user inputs harmful
instructions. A scenario where our poisoned (image, text) data pairs are
included in the training data is assumed. By replacing the original textual
captions with malicious jailbreak prompts, our method can perform jailbreak
attacks with the poisoned images. Moreover, we analyze the effect of poison
ratios and positions of trainable parameters on our attack's success rate. For
evaluation, we design two metrics to quantify the success rate and the
stealthiness of our attack. Together with a list of curated harmful
instructions, a benchmark for measuring attack efficacy is provided. We
demonstrate the efficacy of our attack by comparing it with baseline methods.
Related papers
- Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models [34.557309967708406]
In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking.
We design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement.
Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics.
arXiv Detail & Related papers (2024-05-14T04:51:23Z) - Learning to Poison Large Language Models During Instruction Tuning [10.450787229190203]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the instruction tuning process.
We propose a novel gradient-guided backdoor trigger learning approach to identify adversarial triggers efficiently.
Our strategy demonstrates a high success rate in compromising model outputs.
arXiv Detail & Related papers (2024-02-21T01:30:03Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - Large Language Models Are Better Adversaries: Exploring Generative
Clean-Label Backdoor Attacks Against Text Classifiers [25.94356063000699]
Backdoor attacks manipulate model predictions by inserting innocuous triggers into training and test data.
We focus on more realistic and more challenging clean-label attacks where the adversarial training examples are correctly labeled.
Our attack, LLMBkd, leverages language models to automatically insert diverse style-based triggers into texts.
arXiv Detail & Related papers (2023-10-28T06:11:07Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z) - Attention-Enhancing Backdoor Attacks Against BERT-based Models [54.070555070629105]
Investigating the strategies of backdoor attacks will help to understand the model's vulnerability.
We propose a novel Trojan Attention Loss (TAL) which enhances the Trojan behavior by directly manipulating the attention patterns.
arXiv Detail & Related papers (2023-10-23T01:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.