Visual Adversarial Examples Jailbreak Aligned Large Language Models
- URL: http://arxiv.org/abs/2306.13213v2
- Date: Wed, 16 Aug 2023 22:38:55 GMT
- Title: Visual Adversarial Examples Jailbreak Aligned Large Language Models
- Authors: Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi
Wang, Prateek Mittal
- Abstract summary: We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
- Score: 66.53468356460365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been a surge of interest in integrating vision into Large
Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as
Flamingo and GPT-4. This paper sheds light on the security and safety
implications of this trend. First, we underscore that the continuous and
high-dimensional nature of the visual input makes it a weak link against
adversarial attacks, representing an expanded attack surface of
vision-integrated LLMs. Second, we highlight that the versatility of LLMs also
presents visual attackers with a wider array of achievable adversarial
objectives, extending the implications of security failures beyond mere
misclassification. As an illustration, we present a case study in which we
exploit visual adversarial examples to circumvent the safety guardrail of
aligned LLMs with integrated vision. Intriguingly, we discover that a single
visual adversarial example can universally jailbreak an aligned LLM, compelling
it to heed a wide range of harmful instructions that it otherwise would not)
and generate harmful content that transcends the narrow scope of a `few-shot'
derogatory corpus initially employed to optimize the adversarial example. Our
study underscores the escalating adversarial risks associated with the pursuit
of multimodality. Our findings also connect the long-studied adversarial
vulnerabilities of neural networks to the nascent field of AI alignment. The
presented attack suggests a fundamental adversarial challenge for AI alignment,
especially in light of the emerging trend toward multimodality in frontier
foundation models.
Related papers
- Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks [34.40254709148148]
Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding.
Their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks.
We present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update.
arXiv Detail & Related papers (2024-11-24T05:28:07Z) - A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks.
The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage.
In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors [31.383591942592467]
Vision-language models (VLMs) offer innovative ways to combine visual and textual data for enhanced understanding and interaction.
Patch-based adversarial attack is considered the most realistic threat model in physical vision applications.
We introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, to protectVLMs from the threat of patched visual prompt injectors.
arXiv Detail & Related papers (2024-05-17T04:19:19Z) - Adversarial Robustness for Visual Grounding of Multimodal Large Language Models [49.71757071535619]
Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks.
adversarial robustness of visual grounding remains unexplored in MLLMs.
We propose three adversarial attack paradigms as follows.
arXiv Detail & Related papers (2024-05-16T10:54:26Z) - Survey of Vulnerabilities in Large Language Models Revealed by
Adversarial Attacks [5.860289498416911]
Large Language Models (LLMs) are swiftly advancing in architecture and capability.
As they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows.
This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs.
arXiv Detail & Related papers (2023-10-16T21:37:24Z) - On Evaluating Adversarial Robustness of Large Vision-Language Models [64.66104342002882]
We evaluate the robustness of large vision-language models (VLMs) in the most realistic and high-risk setting.
In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP.
Black-box queries on these VLMs can further improve the effectiveness of targeted evasion.
arXiv Detail & Related papers (2023-05-26T13:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.