Related papers: Refusing Safe Prompts for Multi-modal Large Language Models

Refusing Safe Prompts for Multi-modal Large Language Models

URL: http://arxiv.org/abs/2407.09050v2
Date: Thu, 5 Sep 2024 21:17:13 GMT
Title: Refusing Safe Prompts for Multi-modal Large Language Models
Authors: Zedian Shao, Hongbin Liu, Yuepeng Hu, Neil Zhenqiang Gong,
Abstract summary: We introduce MLLM-Refusal, the first method that induces refusals for safe prompts. We formulate MLLM-Refusal as a constrained optimization problem and propose an algorithm to solve it. We evaluate MLLM-Refusal on four MLLMs across four datasets.
Score: 36.276781604895454
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have become the cornerstone of today's generative AI ecosystem, sparking intense competition among tech giants and startups. In particular, an MLLM generates a text response given a prompt consisting of an image and a question. While state-of-the-art MLLMs use safety filters and alignment techniques to refuse unsafe prompts, in this work, we introduce MLLM-Refusal, the first method that induces refusals for safe prompts. In particular, our MLLM-Refusal optimizes a nearly-imperceptible refusal perturbation and adds it to an image, causing target MLLMs to likely refuse a safe prompt containing the perturbed image and a safe question. Specifically, we formulate MLLM-Refusal as a constrained optimization problem and propose an algorithm to solve it. Our method offers competitive advantages for MLLM model providers by potentially disrupting user experiences of competing MLLMs, since competing MLLM's users will receive unexpected refusals when they unwittingly use these perturbed images in their prompts. We evaluate MLLM-Refusal on four MLLMs across four datasets, demonstrating its effectiveness in causing competing MLLMs to refuse safe prompts while not affecting non-competing MLLMs. Furthermore, we explore three potential countermeasures-adding Gaussian noise, DiffPure, and adversarial training. Our results show that though they can mitigate MLLM-Refusal's effectiveness, they also sacrifice the accuracy and/or efficiency of the competing MLLM. The code is available at https://github.com/Sadcardation/MLLM-Refusal.

Related papers

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time [39.97820478987012]
We introduce a novel method for manipulating the MLLM response preferences using a preference hijacked image.<n>Our method works at inference time and requires no model modifications.<n> Experimental results across various tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2025-09-15T23:55:57Z)
Towards Harmless Multimodal Assistants with Blind Preference Optimization [49.044737689613164]
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. We construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback.
arXiv Detail & Related papers (2025-03-18T12:02:38Z)
Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models [49.48313161005423]
A hybrid language model (HLM) architecture integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM. We propose a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM), wherein the SLM locally measures its output uncertainty and skips both up
arXiv Detail & Related papers (2024-12-17T09:08:18Z)
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs. The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs. We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z)
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU [14.719538667881311]
Inf-MLLM is an efficient inference framework for Multimodal Large Language Models (MLLMs) We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU.
arXiv Detail & Related papers (2024-09-11T12:44:12Z)
Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning [68.83624133567213]
We show that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question. We also propose a simple yet effective method, Active Deduction (AD), to encourage the model to actively perform composite deduction.
arXiv Detail & Related papers (2024-04-19T15:53:27Z)
FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs [57.59518049930211]
We propose the first adversarial attack tailored for video-based large language models (LLMs) Our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations. Our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate.
arXiv Detail & Related papers (2024-03-20T11:05:07Z)
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [98.02846901473697]
We propose ECSO (Eyes Closed, Safety On), a training-free protecting approach that exploits the inherent safety awareness of MLLMs. ECSO generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs.
arXiv Detail & Related papers (2024-03-14T17:03:04Z)
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative [55.08395463562242]
Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI) Our paper explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content.
arXiv Detail & Related papers (2024-02-20T23:08:21Z)
MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance [36.03512474289962]
This paper investigates the novel challenge of defending MLLMs against malicious attacks through visual inputs. Images act as a foreign language" that is not considered during safety alignment, making MLLMs more prone to producing harmful responses. We introduce MLLM-Protector, a plug-and-play strategy that solves two subtasks: 1) identifying harmful responses via a lightweight harm detector, and 2) transforming harmful responses into harmless ones via a detoxifier.
arXiv Detail & Related papers (2024-01-05T17:05:42Z)
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models [41.708401515627784]
We observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images. We introduce MM-SafetyBench, a framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits.
arXiv Detail & Related papers (2023-11-29T12:49:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.