Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
- URL: http://arxiv.org/abs/2404.16557v1
- Date: Thu, 25 Apr 2024 12:11:38 GMT
- Title: Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
- Authors: Kuofeng Gao, Jindong Gu, Yang Bai, Shu-Tao Xia, Philip Torr, Wei Liu, Zhifeng Li,
- Abstract summary: In this paper, we aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation.
We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences.
Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.
- Score: 63.9198662100875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.
It captures high-order correlations of forged images from diverse linguistic feature spaces.
It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference.
We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy.
Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z) - Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving [9.900979396513687]
Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems.
One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information.
We propose Video Token Sparsification (VTS) to significantly reduce the total number of visual tokens while preserving the most salient information.
arXiv Detail & Related papers (2024-09-16T05:31:01Z) - Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection [9.145305176998447]
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities.
We propose a new weakly supervised MVD method that explicitly addresses the challenges of information redundancy, modality imbalance, and modality asynchrony.
Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-05-08T15:27:08Z) - Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images [63.91986621008751]
Large vision-language models (VLMs) have achieved exceptional performance across various multi-modal tasks.
In this paper, we aim to induce high energy-latency cost during inference ofVLMs.
We propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences.
arXiv Detail & Related papers (2024-01-20T08:46:06Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of
Language Models [12.947537874888717]
varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency.
We present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs.
We show that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.
arXiv Detail & Related papers (2023-09-12T22:22:10Z) - Multi-Granularity Network with Modal Attention for Dense Affective
Understanding [11.076925361793556]
In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction.
We propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame.
The proposed method achieves the correlation score of 0.02292 in the EEV challenge.
arXiv Detail & Related papers (2021-06-18T07:37:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.