Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
- URL: http://arxiv.org/abs/2501.01042v2
- Date: Fri, 10 Jan 2025 09:21:43 GMT
- Title: Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
- Authors: Linhao Huang, Xue Jiang, Zhiqiang Wang, Wentao Mo, Xi Xiao, Bo Han, Yongjie Yin, Feng Zheng,
- Abstract summary: Video-based large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks.
In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs.
- Score: 48.76864299749205
- License:
- Abstract: Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models--a common and practical real world scenario--remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal model (IMM) as a surrogate model to craft adversarial video samples. Multimodal interactions and temporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. In addition, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as surrogate model) achieve competitive performance, with average attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for VideoQA tasks, respectively. Our code will be released upon acceptance.
Related papers
- Universal Adversarial Attack on Aligned Multimodal LLMs [1.5146068448101746]
We propose a universal adversarial attack on multimodal Large Language Models (LLMs)
We craft a synthetic image that forces the model to respond with a targeted phrase or otherwise unsafe content.
We will release code and datasets under the Apache-2.0 license.
arXiv Detail & Related papers (2025-02-11T22:07:47Z) - `Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs [6.151779089440453]
We introduce the first voice-based jailbreak attack against multimodal large language models (LLMs)
We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts.
We demonstrate that Flanking Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs.
arXiv Detail & Related papers (2025-02-02T10:05:08Z) - AnyAttack: Targeted Adversarial Attacks on Vision-Language Models toward Any Images [41.044385916368455]
We propose AnyAttack, a self-supervised framework that generates targeted adversarial images for Vision-Language Models without label supervision.
Our framework employs the pre-training and fine-tuning paradigm, with the adversarial noise generator pre-trained on the large-scale LAION-400M dataset.
arXiv Detail & Related papers (2024-10-07T09:45:18Z) - Failures to Find Transferable Image Jailbreaks Between Vision-Language Models [20.385314634225978]
We focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs.
We find that transferable gradient-based image jailbreaks are extremely difficult to obtain.
arXiv Detail & Related papers (2024-07-21T16:27:24Z) - A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks.
The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage.
In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs [57.59518049930211]
We propose the first adversarial attack tailored for video-based large language models (LLMs)
Our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations.
Our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate.
arXiv Detail & Related papers (2024-03-20T11:05:07Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - Cross-Modal Transferable Adversarial Attacks from Images to Videos [82.0745476838865]
Recent studies have shown that adversarial examples hand-crafted on one white-box model can be used to attack other black-box models.
We propose a simple yet effective cross-modal attack method, named as Image To Video (I2V) attack.
I2V generates adversarial frames by minimizing the cosine similarity between features of pre-trained image models from adversarial and benign examples.
arXiv Detail & Related papers (2021-12-10T08:19:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.