Related papers: Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks

Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks

URL: http://arxiv.org/abs/2409.07353v1
Date: Wed, 11 Sep 2024 15:39:42 GMT
Title: Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Authors: Md Zarif Hossain, Ahmed Imteaj,
Abstract summary: Large Vision-Language Models (LVLMs) have significantly advanced AI by excelling in vision-language tasks. Jailbreak attacks bypass safety protocols and cause the model to generate misleading or harmful responses. We propose Sim-CLIP+, a novel defense mechanism that adversarially fine-tunes the CLIP vision encoder by leveraging a Siamese architecture.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs), trained on multimodal big datasets, have significantly advanced AI by excelling in vision-language tasks. However, these models remain vulnerable to adversarial attacks, particularly jailbreak attacks, which bypass safety protocols and cause the model to generate misleading or harmful responses. This vulnerability stems from both the inherent susceptibilities of LLMs and the expanded attack surface introduced by the visual modality. We propose Sim-CLIP+, a novel defense mechanism that adversarially fine-tunes the CLIP vision encoder by leveraging a Siamese architecture. This approach maximizes cosine similarity between perturbed and clean samples, facilitating resilience against adversarial manipulations. Sim-CLIP+ offers a plug-and-play solution, allowing seamless integration into existing LVLM architectures as a robust vision encoder. Unlike previous defenses, our method requires no structural modifications to the LVLM and incurs minimal computational overhead. Sim-CLIP+ demonstrates effectiveness against both gradient-based adversarial attacks and various jailbreak techniques. We evaluate Sim-CLIP+ against three distinct jailbreak attack strategies and perform clean evaluations using standard downstream datasets, including COCO for image captioning and OKVQA for visual question answering. Extensive experiments demonstrate that Sim-CLIP+ maintains high clean accuracy while substantially improving robustness against both gradient-based adversarial attacks and jailbreak techniques. Our code and robust vision encoders are available at https://github.com/speedlab-git/Robust-Encoder-against-Jailbreak-attack.git.

Related papers

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models [11.867355323884217]
We present a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments.<n>Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency.
arXiv Detail & Related papers (2025-06-20T05:30:25Z)
VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models [33.120141513366136]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation.<n>Existing effective attacks always focus on task-specific white-box settings.<n>We propose a simple yet effective Vision Attack (VEAttack) which targets the vision encoder of LVLMs only.
arXiv Detail & Related papers (2025-05-23T03:46:04Z)
Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs [21.258254924259678]
We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test the robustness of Large Language Models. GoAT excels at generating highly effective jailbreak prompts with fewer queries to the victim model than state-of-the-art attacks. GoAT's reasoning is based on a more intricate graph structure.
arXiv Detail & Related papers (2025-04-26T21:06:03Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
HSF: Defending against Jailbreak Attacks with Hidden State Filtering [14.031010511732008]
We propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF) HSF enables the model to preemptively identify and reject adversarial inputs before the inference process begins. It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries.
arXiv Detail & Related papers (2024-08-31T06:50:07Z)
h4rm3l: A language for Composable Jailbreak Attack Synthesis [48.5611060845958]
h4rm3l is a novel approach that addresses the gap with a human-readable domain-specific language. We show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature.
arXiv Detail & Related papers (2024-08-09T01:45:39Z)
Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models [0.0]
We propose Sim-CLIP, an unsupervised adversarial fine-tuning method that enhances the robustness of the widely-used CLIP vision encoder against adversarial attacks. By employing a Siamese architecture with cosine similarity loss, Sim-CLIP learns semantically meaningful and attack-resilient visual representations without requiring large batch sizes or momentum encoders.
arXiv Detail & Related papers (2024-07-20T19:53:52Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.