Related papers: Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

URL: http://arxiv.org/abs/2405.04403v1
Date: Tue, 7 May 2024 15:29:48 GMT
Title: Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks
Authors: Georgios Pantazopoulos, Amit Parekh, Malvina Nikandrou, Alessandro Suglia,
Abstract summary: Augmenting Large Language Models with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs) In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach.
Score: 41.213482317141356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Augmenting Large Language Models (LLMs) with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs). While studying the alignment of LLMs to human values has received widespread attention, the safety of VLMs has not received the same attention. In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach. By comparing each VLM to their respective LLM backbone, we find that each VLM is more susceptible to jailbreaking. We consider this as an undesirable outcome from visual instruction-tuning, which imposes a forgetting effect on an LLM's safety guardrails. Therefore, we provide recommendations for future work based on evaluation strategies that aim to highlight the weaknesses of a VLM, as well as take safety measures into account during visual instruction tuning.

Related papers

VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap [51.287157951953226]
Vision language models (VLMs) come with increased safety concerns. VLMs can be built upon LLMs that have textual safety alignment, but it is easily undermined when the vision modality is integrated. We propose VLM-Guard, an inference-time intervention strategy that leverages the LLM component of a VLM as supervision for the safety alignment of the VLM.
arXiv Detail & Related papers (2025-02-14T08:44:43Z)
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense [34.023473699165315]
We study the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies. We find that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously.
arXiv Detail & Related papers (2025-01-21T15:24:29Z)
Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z)
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations. We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks. This vulnerability poses significant risks to real-world applications. We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models [44.82179903133343]
GLOV enables Large Language Models (LLMs) to act as implicit encoders for Vision-Language Models (VLMs) We show that GLOV shows performance improvement by up to 15.0% and 57.5% for dual-encoder (e.g., CLIP) and VL-decoder (e.g., LlaVA) models for object recognition.
arXiv Detail & Related papers (2024-10-08T15:55:40Z)
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach [25.31933913962953]
Large Language Models (LLMs) have gained widespread use, raising concerns about their security. We introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze. Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs.
arXiv Detail & Related papers (2024-09-21T15:36:26Z)
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs. The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs. We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z)
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? [20.385314634225978]
We focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conduct a large-scale empirical study to assess the transferability of gradient-based universal image "jailbreaks" We find that transferable gradient-based image jailbreaks are extremely difficult to obtain.
arXiv Detail & Related papers (2024-07-21T16:27:24Z)
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts [25.661444231400772]
Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs) These advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content. We introduce Arondight, a standardized red team framework tailored specifically for VLMs.
arXiv Detail & Related papers (2024-07-21T04:37:11Z)
A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks. The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage. In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks. Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning. To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z)
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs [55.91371032213854]
This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. We introduce a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness.
arXiv Detail & Related papers (2023-11-27T18:59:42Z)
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming [72.2127916030909]
We propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation. On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART. Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.
arXiv Detail & Related papers (2023-11-13T19:13:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.